An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

نویسندگان

چکیده

Criminal investigations collect and analyze the facts related to a crime, from which investigators can deduce evidence be used in court. It is multidisciplinary applied science, includes interviews, interrogations, collection, preservation of chain custody, other methods techniques investigation. These produce both digital paper documents that have carefully analyzed identify correlations interactions among suspects, places, license plates, entities are mentioned The computerized processing these helping hand criminal investigation, as it allows automatic identification their relations, being some difficult manually. There exists wide set dedicated tools, but they major limitation: unable process reports Portuguese language, an annotated corpus for purpose does not exist. This presents corpus, composed collection anonymized crime-related documents, were extracted official open sources. dataset was produced result exploratory initiative data websites conditioned-access police reports. evaluated mean precision 0.808, recall 0.722, F1-score 0.733 obtained with classification named-entities present documents. employed benchmark Machine Learning (ML) Natural Language Processing (NLP) tools detect correlate Some examples sentence detection, named-entity recognition, terms domain.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

development and implementation of an optimized control strategy for induction machine in an electric vehicle

in the area of automotive engineering there is a tendency to more electrification of power train. in this work control of an induction machine for the application of electric vehicle is investigated. through the changing operating point of the machine, adapting the rotor magnetization current seems to be useful to increase the machines efficiency. in the literature there are many approaches wh...

15 صفحه اول

ZAC.PB: An Annotated Corpus for Zero Anaphora Resolution in Portuguese

This paper describes the methodology adopted in the construction of an annotated corpus for the study of zero anaphora in Portuguese, the ZAC corpus. To our knowledge, no such corpus exists at this time for the Portuguese language. The purpose of this linguistic resource is to promote the use of automatic discovery of linguistic parameters for anaphora resolution systems. Because of the complex...

متن کامل

TimeBankPT: A TimeML Annotated Corpus of Portuguese

In this paper, we introduce TimeBankPT, a TimeML annotated corpus of Portuguese. It has been produced by adapting an existing resource for English, namely the data used in the first TempEval challenge. TimeBankPT is the first corpus of Portuguese with rich temporal annotations (i.e. it includes annotations not only of temporal expressions but also about events and temporal relations). In additi...

متن کامل

Translation errors from English to Portuguese: an annotated corpus

Analysing the translation errors is a task that can help us finding and describing translation problems in greater detail, but can also suggest where the automatic engines should be improved. Having these aims in mind we have created a corpus composed of 150 sentences, 50 from the TAP magazine, 50 from a TED talk and the other 50 from the from the TREC collection of factoid questions. We have a...

متن کامل

A Portuguese-Spanish Corpus Annotated for Subject Realization and Referentiality

This paper presents a comparable corpus of Portuguese and Spanish consisting of legal and health texts. We describe the annotation of zero subject, impersonal constructions and explicit subjects in the corpus. We annotated 12,492 examples using a scheme that distinguishes between different linguistic levels (phonology, syntax, semantics, etc.) and present a taxonomy of instances on which annota...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Data

سال: 2021

ISSN: ['2306-5729']

DOI: https://doi.org/10.3390/data6070071